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Scientific  Progress 


Because  software  plays  a  critical  role  in  businesses,  governments,  and  societies,  improving  software  productivity  and  quality  is 
an  important  goal  of  software  engineering.  Mining  software  engineering  data  has  recently  emerged  as  a  promising  means  to 
meet  this  goal  due  to  two  main  trends:  the  increasing  abundance  of  such  data  and  its  demonstrated  helpfulness  in  solving 
numerous  real-world  problems.  Our  research  on  mining  SE  data  in  this  year  has  advanced  this  area  in  the  following  primary 
dimensions. 

(1)  To  assure  high  quality  and  security  of  software  systems,  our  research  has  contributed  a  series  of  new  approaches  on 
mining  textual  software  artifacts  along  the  theme  of  recovering  semantic  information  out  of  textual  software  artifacts.  In 
particular,  our  research  [ICSE  2012]  developed  natural  language  processing  techniques  to  mine  code  contracts  out  of  API 
documents.  Our  research  [FSE  2012]  developed  natural  language  processing  techniques  to  mine  access  control  policies  from 
requirements  documents.  Our  contributed  research  is  the  first  of  such  kind  in  the  reserach  literature. 

(2)  To  address  business  requirements  and  to  survive  in  competing  markets,  companies  or  open  source  organizations  often 
have  to  release  different  versions  of  their  projects  in  different  languages.  Manually  migrating  projects  from  one  language  to 
another  (such  as  from  Java  to  C#)  is  a  tedious  and  error-prone  task.  To  reduce  manual  effort  or  human  errors,  tools  can  be 
developed  for  automatic  migration  of  projects  from  one  language  to  another.  However,  these  tools  require  the  knowledge  of 
how  Application  Programming  Interfaces  (APIs)  of  one  language  are  mapped  to  APIs  of  the  other  language,  referred  to  as  API 
mapping  relations. 

Our  research  [ICSE  10]  is  the  first  to  mine  API  mapping  relations  from  one  language  to  another  using  API  client  code.  MAM 
accepts  a  set  of  projects  each  with  two  versions  in  two  languages  and  mines  API  mapping  relations  between  those  two 
languages  based  on  how  APIs  are  used  by  the  two  versions.  These  mined  API  mapping  relations  assist  in  migration  of  projects 
from  one  language  to  another.  We  implemented  a  tool  and  conducted  two  evaluations  to  show  the  effectiveness  of  MAM.  The 
results  show  that  our  tool  mines  25,805  unique  mapping  relations  of  APIs  between  Java  and  C#  with  more  than  80%  accuracy. 
The  results  also  show  that  mined  API  mapping  relations  help  reduce  54.4%  compilation  errors  and  43.0%  defects  during 
migration  of  projects  with  an  existing  migration  tool,  called  Java2CSharp.  The  reduction  in  compilation  errors  and  defects  is  due 
to  our  new  mined  mapping  relations  that  are  not  available  with  the  existing  migration  tool. 

(3)  In  order  to  improve  ineffective  warning  prioritization  of  static  analysis  tools,  various  approaches  have  been  proposed  to 
compute  a  ranking  score  for  each  warning.  In  these  approaches,  an  effective  training  set  is  vital  in  exploring  which  factors 
impact  the  ranking  score  and  how.  While  manual  approaches  to  build  a  training  set  can  achieve  high  effectiveness  but  suffer 
from  low  efficiency  (i.e.,  high  cost),  existing  automatic  approaches  suffer  from  low  effectiveness. 

Our  research  [ASE  10a]  proposes  an  automatic  approach  for  constructing  an  effective  training  set.  In  our  approach,  we  select 
three  categories  of  impact  factors  as  input  attributes  of  the  training  set,  and  propose  a  new  heuristic  for  identifying  actionable 
warnings  to  automatically  label  the  training  set.  Our  empirical  evaluations  show  that  the  precision  of  the  top  22  warnings  for 
Lucene,  20  for  ANT,  and  6  for  Spring  can  achieve  100%  with  the  help  of  our  constructed  training  set. 

(4)  A  bug-tracking  system  such  as  Bugzilla  contains  bug  reports  (BRs)  collected  from  various  sources  such  as  development 
teams,  testing  teams,  and  end  users.  When  bug  reporters  submit  bug  reports  to  a  bug-tracking  system,  the  bug  reporters  need 
to  label  the  bug  reports  as  security  bug  reports  (SBRs)  or  not,  to  indicate  whether  the  involved  bugs  are  security  problems. 
These  SBRs  generally  deserve  higher  priority  in  bug  fixing  than  not-security  bug  reports  (NSBRs).  However,  in  the 
bug-reporting  process,  bug  reporters  often  mislabel  SBRs  as  NSBRs  partly  due  to  lack  of  security  domain  knowledge.  This 
mislabeling  could  cause  serious  damage  to  software-system  stakeholders  due  to  the  induced  delay  of  identifying  and  fixing  the 
involved  security  bugs. 

Our  research  [MSR  10]  proposes  a  new  approach  that  applies  text  mining  on  natural-language  descriptions  of  BRs  to  train  a 
statistical  model  on  already  manually-labeled  BRs  to  identify  SBRs  that  are  manually-mislabeled  as  NSBRs.  Security  engineers 
can  use  the  model  to  automate  the  classification  of  BRs  from  large  bug  databases  to  reduce  the  time  that  they  spend  on 
searching  for  SBRs.  We  evaluated  the  model's  predictions  on  a  large  Cisco  software  system  with  over  ten  million  source  lines 
of  code.  Among  a  sample  of  BRs  that  Cisco  bug  reporters  manually  labeled  as  NSBRs  in  bug  reporting,  our  model  successfully 
classified  a  high  percentage  (78%)  of  the  SBRs  as  verified  by  Cisco  security  engineers,  and  predicted  their  classification  as 
SBRs  with  a  probability  of  at  least  0.98. 

(5)  To  improve  software  quality,  static  or  dynamic  verification  tools  accept  programming  rules  as  input  and  detect  their 
violations  in  software  as  defects.  As  these  programming  rules  are  often  not  well  documented  in  practice,  previous  work 
developed  various  approaches  that  mine  programming  rules  as  frequent  patterns  from  program  source  code.  Then  these 
approaches  use  static  defect-detection  techniques  to  detect  pattern  violations  in  source  code  under  analysis.  These  existing 
approaches  often  produce  many  false  positives  due  to  various  factors. 


Our  research  [ASE  09a]  proposes  a  novel  approach,  called  Alattin,  to  reduce  false  positives  produced  by  these  mining 


approaches.  Alattin  includes  a  new  mining  algorithm  and  a  technique  for  detecting  neglected  conditions  based  on  our  mining 
algorithm.  Our  new  mining  algorithm  mines  alternative  patterns  in  example  form  “PI  or  P2”,  where  PI  and  P2  are  alternative 
rules  such  as  condition  checks  on  method  arguments  or  return  values  related  to  the  same  API  method.  We  conduct  two 
evaluations  to  show  the  effectiveness  of  our  Alattin  approach.  Our  evaluation  results  show  that  (1)  alternative  patterns  reach 
more  than  40%  of  all  mined  patterns  for  APIs  provided  by  six  open  source  libraries;  (2)  the  mining  of  alternative  patterns  helps 
reduce  nearly  28%  of  false  positives  among  detected  violations. 

(6)  Typically,  software  libraries  provide  API  documentation,  through  which  developers  can  learn  how  to  use  libraries  correctly. 
However,  developers  may  still  write  code  inconsistent  with  API  documentation  and  thus  introduce  bugs,  as  existing  research 
shows  that  many  developers  are  reluctant  to  carefully  read  API  documentation.  To  find  those  bugs,  researchers  have  proposed 
various  detection  approaches  based  on  known  specifications.  To  mine  specifications,  many  approaches  have  been  proposed, 
and  most  of  them  rely  on  existing  client  code.  Consequently,  these  mining  approaches  would  fail  to  mine  specifications  when 
client  code  is  not  available. 

Our  research  [ASE  09b]  we  propose  an  approach,  called  Doc2Spec,  that  infers  resource  specifications  from  API 
documentation.  For  our  approach,  we  implemented  a  tool  and  conducted  an  evaluation  on  Javadocs  of  five  libraries.  The 
results  show  that  our  approach  infers  various  specifications  with  relatively  high  precisions,  recalls,  and  F-scores.  We  further 
evaluated  the  usefulness  of  inferred  specifications  through  detecting  bugs  in  open  source  projects.  The  results  show  that 
specifications  inferred  by  Doc2Spec  are  useful  to  detect  real  bugs  in  existing  projects. 

This  paper  received  the  ASE  2009  Best  Paper  Award  and  ACM  SIGSOFT  Distinguished  Paper  Award. 

(7)  An  objective  of  unit  testing  is  to  achieve  high  structural  coverage  of  the  code  under  test.  Achieving  high  structural  coverage 
of  object-oriented  code  requires  desirable  method-call  sequences  that  create  and  mutate  objects.  These  sequences  help 
generate  target  object  states  such  as  argument  or  receiver  object  states  (in  short  as  target  states)  of  a  method  under  test. 
Automatic  generation  of  sequences  for  achieving  target  states  is  often  challenging  due  to  a  large  search  space  of  possible 
sequences.  On  the  other  hand,  code  bases  using  object  types  (such  as  receiver  or  argument  object  types)  include  sequences 
that  can  be  used  to  assist  automatic  test-generation  approaches  in  achieving  target  states. 

Our  research  [ESEC/FSE  09]  proposes  a  novel  approach,  called  MSeqGen,  that  mines  code  bases  and  extracts  sequences 
related  to  receiver  or  argument  object  types  of  a  method  under  test.  Our  approach  uses  these  extracted  sequences  to  enhance 
two  state-of-the-art  test-generation  approaches:  random  testing  and  dynamic  symbolic  execution.  We  conduct  two  evaluations 
to  show  the  effectiveness  of  our  approach.  Using  sequences  extracted  by  our  approach,  we  show  that  a  random  testing 
approach  achieves  8.7%  (with  a  maximum  of  20.0%  for  one  namespace)  higher  branch  coverage  and  a 
dynamic-symbolic-execution-based  approach  achieves  17.4%  (with  a  maximum  of  22.5%  for  one  namespace)  higher  branch 
coverage  than  without  using  our  approach.  Such  an  improvement  is  significant  as  the  branches  that  are  not  covered  by  these 
state-of-the-art  approaches  are  generally  quite  difficult  to  cover. 

(8)  Our  research  [ASE  08a,  ASE  09a,  ICSE  09a]  is  the  first  to  expand  the  mining  scope  from  one  or  a  few  local  project  code 
bases  (often  not  sufficient  for  mining  real  API  properties)  to  the  Internet-scale  open  source  repositories  for  API  property  mining. 
In  particular,  our  research  has  exploited  code  search  engines  such  as  Google  code  search  to  collect  a  sufficiently  large  number 
of  client  code  examples  for  a  specific  API  under  analysis,  and  mine  API  properties  from  these  client  code  examples. 

(9)  Our  research  (described  in  preliminary  work  [ESEC/FSE  07]  for  this  project)  is  the  first  to  exploit  and  adapt  advanced  data 
mining  techniques  (such  as  partial  order  mining)  to  address  unique  mining  requirements  (such  as  expressing  properties  for 
APIs  as  partial  orders),  which  cannot  be  satisfied  by  basic  mining  techniques  commonly  used  by  previous  research. 

(10)  Our  research  [ICSE  09a,  ASE  09a]  has  investigated  complex  patterns  in  common  types  of  API  properties  and  contributed 
new  mining  techniques  to  effectively  mine  these  patterns,  without  being  constrained  by  available  mining  techniques  from  the 
data  mining  community.  In  particular,  we  have  developed  novel  techniques  [ICSE  09a]  that  mine  sequence  association  rules,  a 
new  pattern  proposed  in  our  research,  for  expressing  exception-handling  properties.  We  have  developed  novel  techniques 
[ASE  09a]  that  mine  alternative  patterns,  a  new  pattern  proposed  in  our  research,  for  detecting  neglected  conditions.  Our 
research  has  detected  a  significant  number  of  real  defects  in  open  source  projects  with  these  new  mined  patterns. 

The  PI  is  one  of  leading  researchers  in  actively  promoting  this  area  on  mining  software  engineering  data  within  and  even 
outside  the  SE  community.  He  has  constructed  and  maintained  the  first  and  only  comprehensive  bibliography  on  mining  SE 
data.  He  co-presented  tutorials  or  technical  briefings  on  software 

analytics  or  mining  software  engineering  data  at  top  software  engineering  venues  (7  times  at  ICSE,  1  time  at  FSE,  and  1  time  at 
ASE)  and  data  mining  venues  (KDD  and  ICDM).  He  will  co-organize  the  2013  Nil  Shonan  Meeting  on  Software  Analytics: 
Principles  and  Practice.  He  co-organized  the  2007  Dagstuhl  Seminar  on  Mining  Programs  and  Processes. 


Our  research  has  also  improved  the  Microsoft  Research  Pex  tool  for  testing  of  Object-Oriented  (00)  Software.  Testing  00 
software  (such  as  those  written  in  C#)  is  critical  because  00  languages  have  been  increasingly  used  in  developing  modern 
software  systems,  and  assuring  these  systems’  reliability  is  very  important.  In  unit  testing  of  00  software,  one  important  and  yet 
challenging  problem  is  to  generate  desirable  method  sequences  to  produce  specific  receiver  or  argument  object  states  to  find 
bugs  or  achieve  new  code  coverage.  The  search  space  for  such  desirable  method  sequences  is  huge  and  there  existed  no 
previous  techniques  to  effectively  address  this  problem.  We  have  developed  various  novel  techniques  for  improving  the 
effectiveness  of  symbolic  execution  in  method-sequence  generation.  Our  MSeqGen  technique  [ESEC/FSE  09]  is  the  first  to  use 
code  mining  to  gather  already  used  method  sequences  to  guide  method-sequence  generation.  We  have  also  collaborated  with 
Microsoft  Research  on  developing  techniques  for  security  testing  [ASE  10b],  database  application  testing  [ASE  10c],  advanced 
coverage  criteria  [ICSM  10a],  mutation  testing  [ICSM  10b],  string  operations  [ASE  09c]. 

The  PI  is  one  of  leading  researchers  in  actively  promoting  this  area  of  automated  software  testing.  He  had  co-presented 
tutorials  on  automated  software  testing  at  top  SE  venues  (ICSE  2009  and  2010,  and  OOPSLA  2009).  He  co-organized  2010 
Dagstuhl  Seminar  on  Practical  Software  Testing:  Tool  Automation  and  Human  Factors. 

Technology  Transfer 


